NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Skyhook: Towards an Arrow-Native Storage System

https://doi.org/10.1109/CCGrid54584.2022.00017

Chakraborty, Jayjeet; Jimenez, Ivo; Rodriguez, Sebastiaan Alvarez; Uta, Alexandru; LeFevre, Jeff; Maltzahn, Carlos (May 2022, The 22nd IEEE/ACM Interna- tional Symposium on Cluster, Cloud and Internet Computing (CCGrid22))

With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro have been developed to store data efficiently, save the network, and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1,000,000 reqs/sec, the CPU has become the bottleneck trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the system internals. Previous approaches re-implemented functionality of data processing frameworks and access libraries for a particular storage system, a duplication of effort that might have to be repeated for different storage systems. This paper introduces a new design paradigm that allows extending programmable object storage systems to embed existing, widely used data processing frameworks and access libraries into the storage layer with no modifications. In this approach, data processing frameworks and access libraries can evolve independently from storage systems while leveraging distributed storage systems’ scale-out and availability properties. We present Skyhook, an example implementation of our design paradigm using Ceph, Apache Arrow, and Parquet. We provide a brief performance evaluation of Skyhook and discuss key results.
more » « less
Full Text Available
The CROSS Incubator: A Case Study for funding and training RSEs

https://doi.org/10.48550/ARXIV.2012.01144

Lieggi, Stephanie; Jimenez, Ivo; LeFevre, Jeff; Maltzahn, Carlos (November 2020, RSE-HPC – Introduction: Research Software Engineers in HPC: Creating Community, Building Careers, Addressing Challenges)

The incubator and research projects sponsored by the Center for Research in Open Source Software (CROSS, cross.ucsc.edu) at UC Santa Cruz have been very effective at promoting the professional and technical development of research software engineers. Carlos Maltzahn founded CROSS in 2015 with a generous gift of $2,000,000 from UC Santa Cruz alumnus Dr. Sage Weil and founding memberships of Toshiba America Electronic Components, SK Hynix Memory Solutions, and Micron Technology. Over the past five years, CROSS funding has enabled PhD students to not only create re- search software projects but also learn how to draw in new contributors and leverage established open source software communities. This position paper will present CROSS fellowships as case studies for how university-led open source projects can create a real- world, reproducible model for effectively training, funding and sup- porting research software engineers.
more » « less
Full Text Available
Is Big Data Performance Reproducible in Modern Cloud Networks?

Uta, Alexandru; Custura, Alexandru; Duplyakin, Dmitry; Jimenez, Ivo; Rellermeyer, Jan; Maltzahn, Carlos; Ricci, Robert; Iosup, Alexandru (February 2029, Proceedings of the Seventeenth USENIX Symposium on Networked Systems Design and Implementation (NSDI))

Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data workloads by gathering traces from mainstream commercial clouds and private research clouds. Our data collection consists of millions of datapoints gathered while transferring over 9 petabytes of data. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used. We provide guidelines for practitioners to reduce the volatility of big data performance, making experiments more repeatable.
more » « less
Free, publicly-accessible full text available February 1, 2030

Search for: All records